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We consider the problem of robustly predicting as well as the 
best linear combination of d given functions in least squares regres- 
sion, and variants of this problem including constraints on the pa- 
rameters of the linear combination. For the ridge estimator and the 
ordinary least squares estimator, and their variants, we provide new 
risk bounds of order d/n without logarithmic factor unlike some stan- 
dard results, where n is the size of the training data. We also provide 
a new estimator with better deviations in the presence of heavy-tailed 
noise. It is based on truncating differences of losses in a min-max 
framework and satisfies a d/n risk bound both in expectation and in 
deviations. The key common surprising factor of these results is the 
absence of exponential moment condition on the output distribution 
while achieving exponential deviations. All risk bounds are obtained 
through a PAC-Bayesian analysis on truncated differences of losses. 
Experimental results strongly back up our truncated min-max esti- 
mator. 

1. Introduction. 

Our statistical task. Let Zi = {Xi,Yi), . . . ,Zn = {Xn,Yn) be n > 2 pairs 
of input-output and assume that each pair has been independently drawn 
from the same unknown distribution P. Let X denote the input space and 
let the output space be the set of real numbers R, so that P is a proba- 
bility distribution on the product space Z = X x W. The target of learning 
algorithms is to predict the output Y associated with an input X for pairs 
Z = {X, Y) drawn from the distribution P. The quality of a (prediction) 
function / : — )■ R is measured by the least squares risk: 

R{f)^Ez^p{[Y-f{X)f}. 
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Through the paper, we assume that the output and all the prediction func- 
tions we consider are square integrable. Let O be a closed convex set of W^, 
and ipi,. . . ,{pd be d prediction functions. Consider the regression model 

The best function /* in is defined by 

d 

r = V0*c^jGargmini2(/). 

Such a function always exists but is not necessarily unique. Besides, it is 
unknown since the probability generating the data is unknown. 

We will study the problem of predicting (at least) as well as function /*. In 
other words, we want to deduce from the observations Zi, . . . , Zn a function / 
having with high probability a risk bounded by the minimal risk R{f*) on 
plus a small remainder term, which is typically of order d/n up to a possible 
logarithmic factor. Except in particular settings (e.g., is a simplex and 
d > \/n), it is known that the convergence rate d/n cannot be improved in 
a minimax sense (see [11] and [12] for related results). 

More formally, the target of the paper is to develop estimators / for 
which the excess risk is controlled in deviations, that is, such that for an 
appropriate constant n> 0, for any e > 0, with probability at least 1 — e, 

(1.1) Rif)-Rin<^^^^^^^^^. 

n 

Note that by integrating the deviations [using the identity E(VF) = 
Jq F{W > t) dt which holds true for any non- negative random variable W\ , 
inequality (1.1) implies 

(1.2) Ei?(/)-^(r)<^^^^^. 

n 

In this work, we do not assume that the function 

which minimizes the risk R among all possible measurable functions, belongs 
to the model F. So we might have /* 7^ /(^'^s) and in this case, bounds of 
the form 

(1.3) Ei?( /) - i2(/('-<=g)) < C\Rin - /2(/(''^g))] + K- 

n 

with a constant C larger than 1, do not even ensure that 'ER{f) tends 
to R{f*) when n goes to infinity. These kinds of bounds with C > 1 have been 
developed to analyze nonparametric estimators using linear approximation 
spaces, in which case the dimension d is a function of n chosen so that the 
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bias term R{f*) — R{f^^'^^^) has the order d/n of the estimation term (see 
[3, 6, 10] and references within). Here we intend to assess the generahzation 
abiUty of the estimator even when the model is misspecified [namely, when 
R{f*) > Moreover, we do not assume either that Y - /('^'=s)(X) 

and X are independent or that Y has a subexponential tail distribution: for 
the moment, we just assume that Y — f*{X) admits a finite second-order 
moment in order that the risk of /* is finite. 

Several risk bounds with C = 1 can be found in the literature. A survey 
on these bounds is given in [1], Section 1. Let us mention here the closest 
bound to what we are looking for. From the work of Birge and Massart [4], 
we may derive the following risk bound for the empirical risk minimizer on 
a L°° bah (see Appendix B of [1]). 

Theorem 1.1. Assume that T has a diameter H for L"^ -norm, that is, 
for any /i,/2 in T , sup^g;^|/i(x) — /2(x)| <}i and there exists a function 
foG J' satisfying the exponential moment condition 

(1.4) foranyxeX E{exp[A-^\Y - fo{X)\]\X = x} < M 

for some positive constants A and M. Let 

5 . r \\Y^'j = lSj4'j\\'L 

^ = ^ """"p — iim — ' 

<?'i,-,0deeRd-{o} Irlloo 

where the infimum is taken with respect to all possible orthonormal bases 
of T for the dot product (/i,/2) i— )■ ]E[/i(X)/2(X)] (when the set T admits 
no basis with exactly d functions, we set B = +oo ). Then the empirical risk 
minimizer satisfies for any e > 0, with probability at least 1 — e, 

^(j(crm)^ _ ^^j*^ < ^^^2 ^ ^2^f^log[2 + {B/n) A {n/d)] + log(e-i) 



n 

where k is a positive constant depending only on M . 

The theorem gives exponential deviation inequalities of order at worse 
d\og{n/d)/n and, asymptotically, when n goes to infinity, of order d/n. This 
work will provide similar results under weaker assumptions on the output 
distribution. 

Notation. When Q = W^, the function /* and the space F will be writ- 
ten /jjjj and J-\\a to emphasize that J- is the whole linear space spanned by 

ipi,...,LPcl- 

Jiin = span{99i , . . . , and G arg min i?(/) . 

The Euclidean norm will simply be written as || • ||, and (•, •) will be its asso- 
ciated inner product. We will consider the vector valued function 99 : Af — )• R"^ 
defined by ^p{X) = [ipkiX)]^^^, so that for any G 0, we have 

fe{X) = {e,^{X)). 
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The Gram matrix is the d x d-matrix Q = ¥.[ip{X)ip{X)'^]. The empirical 
risk of a function / is r(/) = ^ J2^=i[fi'^i) ~ -^i]^ ^^"^ A > 0, the ridge 
regression estimator on T is defined by /("^^s^) = /^(Hdgo) with 

^("'^s^) Gargmin{r(/0) + A||^f}, 
6»g0 

where A is some non-negative real parameter. In the case when A = 0, the 
ridge regression fi^^'^s^) nothing but the empirical risk minimizer yC'^^'™). 
Besides, the empirical risk minimizer when = is also called the ordinary 
least squares estimator, and will be denoted by f^°^^\ 

In the same way, we introduce the optimal ridge function optimizing the 
expected ridge risk: f = fg with 

(1.5) ^Gargmin{i?(^) + A||0f}. 

See 

Finally, let Q\ = Q + XI be the ridge regularization of Q, where / is the 
identity matrix. 

Why should we be interested in this task? There are four main reasons. 
First, we intend to provide a nonasymptotic analysis of the parametric lin- 
ear least squares method. Second, the task is central in nonparametric es- 
timation for linear approximation spaces (piecewise polynomials based on 
a regular partition, wavelet expansions, trigonometric polynomials. . .). 

Third, it naturally arises in two-stage model selection. Precisely, when 
facing the data, the statistician often has to choose several models which are 
likely to be relevant for the task. These models can be of similar structure 
(like embedded balls of functional spaces) or, on the contrary, of a very 
different nature (e.g., based on kernels, splines, wavelets or on a parametric 
approach). For each of these models, we assume that we have a learning 
scheme which produces a "good" prediction function in the sense that it 
predicts as well as the best function of the model up to some small additive 
term. Then the question is to decide on how we use or combine/aggregate 
these schemes. One possible answer is to split the data into two groups, 
use the first group to train the prediction function associated with each 
model, and finally use the second group to build a prediction function which 
is as good as (i) the best of the previously learned prediction functions, 
(ii) the best convex combination of these functions or (iii) the best linear 
combination of these functions. This point of view has been introduced by 
Nemirovski in [8] and optimal rates of aggregation are given in [11] and the 
references within. This paper focuses more on the linear aggregation task 
[even if (ii) enters in our setting], assuming implicitly here that the models 
are given in advance and are beyond our control and that the goal is to 
combine them appropriately. 

Finally, in practice, the noise distribution often departs from the nor- 
mal distribution. In particular, it can exhibit much heavier tails, and con- 
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sequently induce highly non-Gaussian residuals. It is then natural to ask 
whether classical estimators such as the ridge regression and the ordinary 
least squares estimator are sensitive to this type of noise, and whether we 
can design more robust estimators. 

Outline and contributions. Section 2 provides a new analysis of the ridge 
estimator and the ordinary least squares estimator, and their variants. The- 
orem 2.1 provides an asymptotic result for the ridge estimator, while Theo- 
rem 2.2 gives a nonasymptotic risk bound for the empirical risk minimizer, 
which is complementary to the theorems put in the survey section. In par- 
ticular, the result has the benefit to hold for the ordinary least squares esti- 
mator and for heavy-tailed outputs. We show quantitatively that the ridge 
penalty leads to an implicit reduction of the input space dimension. Sec- 
tion 3 shows a nonasymptotic d/n exponential deviation risk bound under 
weak moment conditions on the output Y and on the (i-dimensional input 
representation <f{X). 

The main contribution of this paper is to show through a PAC-Bayesian 
analysis on truncated differences of losses that the output distribution does 
not need to have bounded conditional exponential moments in order for 
the excess risk of appropriate estimators to concentrate exponentially. Our 
results tend to say that truncation leads to more robust algorithms. Lo- 
cal robustness to contamination is usually invoked to advocate the removal 
of outliers, claiming that estimators should be made insensitive to small 
amounts of spurious data. Our work leads to a different theoretical expla- 
nation. The observed points having unusually large outputs when compared 
with the (empirical) variance should be down-weighted in the estimation 
of the mean, since they contain less information than noise. In short, huge 
outputs should be truncated because of their low signal-to-noise ratio. 

2. Ridge regression and empirical risk minimization. We recall the def- 
inition 



where is a closed convex set, not necessarily bounded (so that O = M'^ 
is allowed). In this section we provide exponential deviation inequalities for 
the empirical risk minimizer and the ridge regression estimator on under 
weak conditions on the tail of the output distribution. 

The most general theorem which can be obtained from the route followed 
in this section is Theorem 1.5 of the supplementary material [2]. It is ex- 
pressed in terms of a series of empirical bounds. The first deduction we can 
make from this technical result is of an asymptotic nature. It is stated under 
weak hypotheses, taking advantage of the weak law of large numbers. 
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Theorem 2.1. For A > 0, let f be its associated optimal ridge function 
[see (1-5)]. Let us assume that 

(2.1) E[||^(X)r]<+oo 
and 

(2.2) E{||<^(X)f[/(X)-y]2}<+oo. 

Let vi> ■ ■ ■ > Ud be the eigenvalues of the Gram matrix Q = K[ip{X)ip{X)'^], 
and let Q\ = Q + XL be the ridge regularization of Q. Let us define the 
effective ridge dimension 



When A = 0, D is equal to the rank of Q and is otherwise smaller. For any 
e > 0, there is n^, such that for any n > n^, with probability at least 1 — £, 

^(^>idgc))^^||glridge)||2 

<unn{R{fe) + X\\9f} 

^mE{\\Q",'/'^{X)r[f{X)-Y]^}D 



+ 1,000 sup 



E{\\Q-'/\{XW} 

E[{v,^{X)nf{X)-Y]^] log(3e-i: 



E{{v,ip{X))'^) + X\\v\\^ n 



<min{i?(/e) + A||0f} 



+ esssupE{[y-/(X)]^|X} 



2,-,. 30L> + 1,000 log(3e-i) 



n 

Proof. See Section 1 of the supplementary material [2]. □ 

This theorem shows that the ordinary least squares estimator (obtained 
when = M'^ and A = 0), as well as the empirical risk minimizer on any 
closed convex set, asymptotically reaches a d/n speed of convergence under 
very weak hypotheses. It shows also the regularization effect of the ridge 
regression. There emerges an effective dimension D, where the ridge penalty 
has a threshold effect on the eigenvalues of the Gram matrix. 

Let us remark that the second inequality stated in the theorem provides 
a simplified bound which makes sense only when 

esssupE{[y - f{X)f\X} < +00 

implying that \\f — Z'-'^'^^'* ||oo < +oo. We chose to state the first inequality as 
well, since it does not require such a tight relationship between / and /(''°s). 
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On the other hand, the weakness of this result is its asymptotic nature: 
may be arbitrarily large under such weak hypotheses, and this happens even 
in the simplest case of the estimation of the mean of a real- valued random 
variable by its empirical mean [which is the case when d=l and ^{X) = 1]. 

Let us now give some nonasymptotic rate under stronger hypotheses and 
for the empirical risk minimizer (i.e., A = 0). 

Theorem 2.2. Assume that E{[Y - f*{X)]^} < +oo and 
B= sup ||/||^/E[/(X)2]<+oo. 

/Gspan{(pi,...,</3d}-{0} 

Consider the (unique) empirical risk minimizer jC^^m) _ f^^^^^^ )■ {0^^^™\ 
^{x)) onJ^/orw/iic/i^(^™) espan{¥?(Xi),...,^(X„)}.i For any values of e 
and n such that 2/n <e <1 and 

WB^d^- 



n 



> 1280-B^ 



3Bd + \og{2/e) 



n 



with probability at least I — £, 
(2.3) 



< 1920B^/E{[Y - f*{X)]'^} 



3Bd + log(2e-i) (ABdV 



+ 



n 



Proof. See Section 1 of the supplementary material [2]. □ 

It is quite surprising that the traditional assumption of uniform bounded- 
ness of the conditional exponential moments of the output can be replaced 
by a simple moment condition for reasonable confidence levels (i.e., e > 2/n). 
For highest confidence levels, things are more tricky since we need to control 
with high probability a term of order \r{f*) — R{f*)]d/n (see Theorem 1.6). 
The cost to pay to get the exponential deviations under only a fourth-order 
moment condition on the output is the appearance of the geometrical quan- 
tity B as a multiplicative factor. 

To better understand the quantity B, let us consider two cases. First, 
consider that the input is uniformly distributed on X = [0, 1] , and that the 
functions ipi,...,ip[i belong to the Fourier basis. Then the quantity B be- 
haves like a numerical constant. On the contrary, if we take ipi, . . . ,(pd as the 
first d elements of a wavelet expansion, the more localized wavelets induce 
high values of B, and B scales like y/d, meaning that Theorem 2.2 fails to 
give a d/n-excess risk bound in this case. This limitation does not appear 
in Theorem 2.1. 



^When T = -Fiin, we have O^'"''^'' = X+Y, with X = (v?, (A:0)i<,<„,i<j<d, Y = KJ^^i 
and X"*" is the Moore-Penrose pseudoinverse of X. 
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To conclude, Theorem 2.2 is hmited in at least four ways: it involves the 
quantity B, it applies only to uniformly bounded (p{X), the output needs to 
have a fourth moment, and the confidence level should be as great as e > 2/n. 
These limitations will be addressed in the next section by considering a more 
involved algorithm. 

3. A min max estimator for robust estimation. This section provides 
an alternative to the empirical risk minimizer with nonasymptotic expo- 
nential risk deviations of order d/n for any confidence level. Moreover, we 
will assume only a second-order moment condition on the output and cover 
the case of unbounded inputs, the requirement on (p{X) being only a finite 
fourth-order moment. On the other hand, we assume here that the set Q 
of the vectors of coefficients is bounded. The computability of the proposed 
estimator and numerical experiments are discussed at the end of the section. 

3.1. The min-max estimator and its theoretical guarantee. Let a > 0, 
A > 0, and consider the truncation function: 

{-log(l-x + a;V2), < x < 1, 
log(2), x>l, 
— x), a; < 0. 

For any 6,6' gQ, introduce 

n 

V{e,e') = naX{\\6f - \\6'f) + Y^^l;{a[Yi - fe{Xi)f - a[Yi - fe'iX,)]^). 

i=l 

We recall that f = f§ with 6 G argmineg0{i?(/0) + A||^|p}, and that the 
effective ridge dimension is defined as 

d 

D = n\Q~x'\{X)r] = Tr[(Q + A/)-iQ] = J] ^t{u, > 0) < d, 

1=1 

where vi > ■ ■ ■ > Vd are the eigenvalues of the Gram matrix Q = 
E[ip{X)ip{Xf]. Let us assume in this section that 

(3.1) E{{Y-f{X)]'}<+oo, 
and that for any j £ {1, . . . ,d}, 

(3.2) E[v9j(X)^] <+oo. 
Define 

(3.3) 5 = {/G Jiin:E[/(X)2] = l}, 



(3.4) a = JE{[Y-fiXW} = jR{f) 
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(3.5) x = maxVE[/(X)4], 



E{[<^(X)^Q-V(X)]2} 

nv{x)TQi\{x)] 



(3.6) 



(3.7) K = ^ = ^ , 

E{[Y-f{XW} ct2 



(3.8) T = ^^max^ ^All^ - ^'P + mfeiX) - fe'{X)]^}. 



Theorem 3.1. Let us assume that (3.1) and (3.2) hold. For some nu- 
merical constants c and d , for 

n > CKxD 

by taking 

(3.9) a = ^ (,_2!^\ 



for any estimator fg satisfying 9 £ Q a.s., for any e > and any A > 0, with 
probability at least 1 — e, we have 

Rifg) + X\\ef<mm{Rife) + X\\ef} 

-\ (max'D{9,9i) — inf maxPC" " 



na \6»ige 6»G06»iGe / n 



~^ n ' rfi J 1 — cKxD/n 

Proof. See Section 2 of the supplementary material [2]. □ 
By choosing an estimator such that 

maxP(^,6li) < inf maxl)(9,9i) +cr^ — , 

0166 6lG0 6»ig0 n 

Theorem 3.1 provides a nonasymptotic bound for the excess (ridge) risk with 
a D/n convergence rate and an exponential tail even when neither the out- 
put Y nor the input vector (p{X) have exponential moments. This stronger 
nonasymptotic bound compared to the bounds of the previous section comes 
at the price of replacing the empirical risk minimizer by a more involved es- 
timator. Section 3.3 provides a way of computing it approximately. 

Theorem 3.1 requires a fourth-order moment condition on the output. In 
fact, one can replace (3.1) by the following second-order moment condition 
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on the output: for any j G {1, . . . , d}, 

E{(^,-(X)2[y-/(X)]2}<+00, 

and still obtain a D/n excess risk bound. This comes at the price of a more 
lengthy formula, where terms with k' become terms involving the quantities 
max;e5lE{/(^)'[^ - /(^)]'} and EMXfQ-^^{X)[Y - f{X)]^}. (This 
can be seen by not using Cauchy-Schwarz's inequality in (2.5) and (2.6) of 
the supplementary material [2].) 

3.2. The value of the uncentered kurtosis coefficients x o^n-d We see 
that the speed of convergence of the excess risk in Theorem 3.1 (page 9) 
depends on three kurtosis-like coefficients, X) and n' . The third, k' , is 
concerned with the noise, conceived as the difference between the observed 
output Y and its best explanation f{X) according to the ridge criterion. 
The aim of this section is to study the order of magnitude of the two other 
coefficients x and k, which are related to the design distribution, 

X = sup{Ei{u, ^(X))4)i/2; G E((n, ^{X)f) < 1} 

and 

K = D-'E{\\Q-'/'^{X)f)'/'. 
We will review a few typical situations. 

3.2.1. Gaussian design. Let us assume first that ^p{X) is a multivari- 
ate centered Gaussian random variable. In this case, its covariance matrix 
coincides with its Gram matrix Qq and can be written as 

Qo = C/"^Diag(i/j,i = 1, . . .,n)U, 

"1/2 

where U is an orthogonal matrix. Using [/, we can introduce W = UQ^ '^{X). 
It is also a Gaussian vector, with covariance Diag[fj/(A + Vi),i = 1, . . . 

— 1/2 

Moreover, since U is orthogonal, \\W\\ = \\Q^ ' (f{X)\\, and since (Wi,VFj) 
are uncorrelated when i^ they are independent, leading to 



E(||Q-^/V(^)r) = E 



d 

E- 

i=l ^<i<j<d 
->2 



E{Wt) + 2 ^ E{Wf)E{Wf) 



= D' + 2D2, 
where D2 = Yli=i (A+t-)^ • Thus, in this case. 
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Moreover, as for any value of n, {u,ip{X)) is a Gaussian random variable, 

This situation arises in compressed sensing using random projections on 
Gaussian vectors. Specifically, assume that we want to recover a signal / G 
M*^ that we know to be well approximated by a linear combination of d basis 
vectors /i, . . . , /rf. We measure n<^M projections of the signal / on i.i.d. M- 
dimensional standard normal random vectors Xi, . . . , X„ :¥{ = {f,Xi), i = 
1, . . . , n. Then, recovering the coefficient 6i,. . . ,6d such that / = Yl'j=i ^jfj 
is associated to the least squares regression problem, ^ ~ X]j=i ^'/'il^); 
with ipj{x) = {fj,x), and X having a M-dimensional standard normal dis- 
tribution. 

3.2.2. Independent design. Let us study now the case when almost surely 
fi{X) = 1 and if2{X), . . . ,ipdiX) are independent. To compute Xi we can 
assume without loss of generality that ip2{X), . . . , ^Pd{X) are centered and of 
unit variance, since this renormalization is precisely the linear transforma- 
tion that turns the Gram matrix into the identity matrix. Let us introduce 

with the convention ^ = 0. A computation similar to the one made in the 
Gaussian case shows that 



Moreover, for any u G M'^ such that ||u|| = 1, 
d 

E((n,(^(X))4) = 5^ntE(^,(X)4) + 6 ^ u^u]E[^i{Xfnv,{X)'] 

i=l ^<i<j<d. 
d 

+ 4^niufE[v9i(X)3] 

1=2 

i=l i<j 1=2 



ll"ll=l 1=1 \i=l / 1=2 



< sup {xt-^)}^< + ^2^ut ] +Ax:'^ui2^uf 

33/2 (xl X*>3, 
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Thus, in this case, 



X< { 



X* 1 + 



33/2 \ 1/2 



3 + ^X^/^ + « 



d 



1/2 



X* > V3, 
1 < X* < 



If, moreover, the random variables ip2iX), . . . ,ip(i{X) are not skewed, in 
the sense that E[93j(X)^] =0, j = 2, . . . ,d, then 

X* > \/3, 



X = X*, 



X< 3 + 



d 



1/2 



1 < X* < V3. 



3.2.3. Bounded design. Let us assume now that the distribution of ^p{X) 
is almost surely bounded and nearly orthogonal. These hypotheses are suited 
to the study of regression in usual function bases, like the Fourier basis, 
wavelet bases, histograms or splines. 

More precisely, let us assume that P(||(/9(X)|| < B) = 1 and that for some 
positive constant A and any n E M'^, 

\\u\\<AE[{u,ip{X)f]^/^. 

This appears as some stability property of the partial basis ipj with respect 
to the L2-norm, since it can also be written as 

w \ 2n 



^u]<A'E 



u G . 



In terms of eigenvalues, A~'^ can be taken to be the lowest eigenvalue i^d of 
the Gram matrix Q. The value of A can also be deduced from a condition 
saying that (pj are nearly orthogonal in the sense that 

E[ipj{Xf]>l and \E[ipj{X)MX)]\<^^^- 
In this situation, the chain of inequalities 

E[{u,ip{X))'^] < \\ufB^E[{u,y^{X))^] < A^B^E[{u,ip{X)ff 
shows that x ^ AB. On the other hand, 

E[iiQ^^/V(^)r] 

= E[sup{{u,ip{X))^;ueR'^,\\Q]!\\\ < 1}] 



<E[snp{\\ufB^{u,^{X)r;\\Q''^^u\\<l}] 



V2, 



ROBUST LINEAR LEAST SQUARES REGRESSION 



13 



<n^nv{{l + \A^r^A^B^\\Q^^\f{u,^{X)f-\\Q^^\\\<l}] 

A? „r,,„ — 1/2 / niiOt D 

showing that k < , . 

- V(i+AA2)D 

For example, if X is the uniform random variable on the unit interval 
and (fj , j = 1, . . . ,d, are any functions from the Fourier basis [meaning that 
they are of the form -v/2cos(2/c7rX) or -v/2sin(2A;7rX)], then A = l (because 
they form an orthogonal system) and B < v2d. 

A localized basis like the evenly spaced histogram basis of the unit interval 

^pj{x) = Vdi{x e [{j - i)MiM), j = i,...,d, 

will also be such that A = 1 and B = ^/d. Similar computations could be 
made for other local bases, like wavelet bases. 

Note that when x is of order ^fd, and k and k' of order 1, Theo- 
rem 3.1 means that the excess risk of the min-max truncated estimator / 
is upper bounded by Cd/n provided that n > Cd^ for a large enough con- 
stant C. 

3.2.4. Adaptive design planning. Let us discuss the case when X is some 
observed random variable whose distribution is only approximately known. 
Namely, let us assume that (vj)j=i is some basis of functions in L2[P] with 

some known coefficient where P is an approximation of the true distribu- 
tion of X in the sense that the density of the true distribution P of X with 
respect to the distribution P is in the range {r]~^,T]). In this situation, the 
coefficient x satisfies the inequality x < V'^^'^X- Indeed, 

<VX^E^^^[{u,ip{X)ff 

<v''x'^xM{u,^{X)f]'. 
In the same way, k < rf/'^R. Indeed, 

E[sup{(7x,99(X))^E((n,(^(X))2) < 1}] 

< r/E[sup{(tx,(/.(X))4;]E((^z,99(X))2) < ry}] 

< 77']E[sup{(n, ^(X))^ ]E((n, V9(X))2) < 1}] 
<r/3«2]E[sup{(n,^(X))2;]E((n,^(X))2)<l}]2 

<rfR^n^nv{{u,^{X))^-n{u,^{X))^)<l]f. 

Let us conclude this section with some scenario for the case when X is 
a real- valued random variable. Let us consider the distribution function of P, 

F{x)=f{X<x). 
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Then, if P has no atoms, the distribution of F[X) would be uniform on 
(0, 1) if X were distributed according to P. In other words, P o = U, 
the uniform distribution on the unit interval. Starting from some suitable 
partial basis of L2[(0,1),U] like the ones discussed above, we can 

build a basis for our problem as 

^j{X) = ip,[F{X)]. 

Moreover, if P is absolutely continuous with respect to P with density g, then 
P o is absolutely continuous with respect to P o = U, with density 
go F~^, and, of course, the fact that g takes values in {r]~^,7]) implies the 
same property for goF~^ . Thus, if x and k are the coefficients corresponding 
to fj{U) when U is the uniform random variable on the unit interval, then 
the true coefficient x [corresponding to if>j{X)] will be such that x ^ v'^^'^X 
and K < ifl'^'k. 

3.3. Computation of the estimator. For ease of description of the algo- 
rithm, we will write X for (p{X), which is equivalent to considering without 
loss of generality that the input space is M"^ and that the functions ipi, . . . ,ipci 
are the coordinate functions. Therefore, the function fg maps an input x to 
{9,x). Let us introduce 

Li{e) = a{{6,X,)-Yi)\ 
For any subset of indices I C {1, . . . , n}, let us define 

rj{9) = x\\ef + ^y^Liie). 

a\I\ ^-^ 

We suggest the following heuristics to compute an approximation of 

arg min sup ViO^O'): 

• Start from /i = { 1 , . . . , n} with the ordinary least squares estimate 

9i = arg min r/j . 

• At step number k, compute 

• Consider the sets 

where ^ is the (pseudo-)inverse of the matrix Q^. 
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• Let us define 

^'fc.iO?) = argmin rj^.^(^), 

Rd 

JkM = {i^h- \Li{Ok,i{r])) -Li{Ok)\ < 1}, 
OkMv) = argminrj^,2(r,), 

(%,4)= argmin max V{ek/i7]),Bj), 

r;GK+/G{l,2}J=lv-,fc 
h+l = Jk/kiVk), 

• Stop when 

max V{0kj^i,6j) > 0, 

j=l,...,k 

and set 9 = 9k as tlie final estimator of 9. 

Note that there will be at most n steps, since Ik+i ^ Ik and in practice much 
less in this iterative scheme. Let us give some justification for this proposal. 
Let us notice first that 

V{9 + h,9) 

= naX{\\9 + hf - \\9f) 

n 

+ Y,^{a[2{h,Xi){{9,Xi) - y,) + {h,Xif]). 

i=l 

Hopefully, 9 = aTgmmg^^d{R{fe) + A||0|p) is in some small neighborhood 
of 9k already, according to the distance defined by Q ^ Qk- So we may 
try to look for improvements of 9^ by exploring neighborhoods of 9^ of 
increasing sizes with respect to some approximation of the relevant norm 

\\9\\i=m,xn 

Since the truncation function ip is constant on (— oo,— 1] and [l,+oo), 
the map 9 i— )■ T){9, 9^) induces a decomposition of the parameter space into 
cells corresponding to different sets / of examples. Indeed, such a set / is 
associated to the set Cj of 9 such that Li{9) — Li{9k) < 1 if and only if i G /. 
Although this may not be the case, we will do as if the map 9 i— )■ 'D{9,9k) 
restricted to the cell Cj reached its minimum at some interior point of Cj, 
and approximates this minimizer by the minimizer of rj. 

The idea is to remove first the examples which will become inactive in 
the closest cells to the current estimate 9k- The cells for which the contribu- 
tion of example number i is constant are delimited by at most four parallel 
hyper planes. 
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It is easy to see that the square of the inverse of the distance of 9k to the 
closest of these hyperplanes is equal to 

^1/2 

Indeed, this distance is the infimum of \\Q}J ^||, where is a solution of 

{h,x^f + 2{h,Xi){{ek,Xi) - y,) = -. 

a 

It is computed by considering h of the form h = ^\\Qi^^^'^Xi\\~^Q^^Xi and 
solving an equation of order two in ^. 

This explains the proposed choice of Jk,ii'n)- Then a first estimate ^a;,i(??) 
is computed on the basis of this reduced sample, and the sample is read- 
justed to Jfc,2(^) by checking which constraints are really activated in the 
computation of T){0k^i{ii),0k)- The estimated parameter is then readjusted, 
taking into account the readjusted sample (this could as a variant be iterated 
more than once). Now that we have some new candidates Ok/{r]), we check 
the minimax property against them to elect Ik+i and 9k+i- Since we did 
not check the minimax property against the whole parameter set = M'^, 
we have no theoretical warranty for this simplified algorithm. Nonetheless, 
similar computations to what we did could prove that we are close to solving 
minj=i^ ), since we checked the minimax property on the reduced 

parameter set {Oj,j = 1, . . . , A;}. Thus, the proposed heuristics are capable of 
improving on the performance of the ordinary least squares estimator, while 
being guaranteed not to degrade its performance significantly. 

3.4. Synthetic experiments. In Section 3.4.1, we detail the different kinds 
of noises we work with. Then, Sections 3.4.2, 3.4.3 and 3.4.4 describe the 
three types of functional relationships between the input, the output and the 
noise involved in our experiments. A motivation for choosing these input- 
output distributions was the ability to compute exactly the excess risk, and 
thus to compare easily estimators. Section 3.4.5 provides details about the 
implementation, its computational efficiency and the main conclusions of the 
numerical experiments. Figures and tables are postponed to the Appendix. 

3.4.1. Noise distributions. In our experiments, we consider different ty- 
pes of noise that are centered and with unit variance: 

• the standard Gaussian noise, VF~A/'(0, 1), 

• a heavy-tailed noise defined by = sign(y)/|l/|V9, with 1/~AA(0,1), 
a standard Gaussian random variable and q = 2.01 (the real number q is 
taken strictly larger than 2 as for q = 2, the random variable W would 
not admit a finite second moment). 
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an asymmetric heavy-tailed noise defined by 

'\V\-^/i, ifF>0, 



W-- 



otherwise, 



with q = 2.01 with V A/'(0, 1) a standard Gaussian random variable. 
• a mixture of a Dirac random variable with a low- variance Gaussian ran- 
dom variable defined by, with probability p, W = y^(l — p) /p, and with 
probability 1 — p, W is drawn from 



VRi-p) p p(i-p) 



1 — p '1— p (1— p)^ 

The parameter p G [p, 1] characterizes the part of the variance of W ex- 
plained by the Gaussian part of the mixture. Note that this noise admits 
exponential moments, but for n of order 1/p, the Dirac part of the mixture 
generates low signal-to-noise points. 

3.4.2. Independent normalized covariates jlNC{n,d)]. In INC(n,(i), we 
consider (p{X) = X , and the input-output pair is such that 

Y = {e*,X) + aW, 

where the components of X are independent standard normal distributions, 
6* = (10, . . . , 10)^ G M'^ and a = 10. 

3.4.3. Highly correlated covariates [HCC{n,d)]. In HCC(n,(i), we con- 
sider (p{X) = X , and the input-output pair is such that 

Y = {e*,X) +aW, 

where X is a multivariate centered normal Gaussian with covariance ma- 
trix Q obtained by drawing a {d, d)-matrix A of uniform random variables 
in [0, 1] and by computing Q = AA^ , 9* = (10, . . . , 10)^ G M'^ and a = 10. 
So the only difference with the setting of Section 3.4.2 is the correlation 
between the covariates. 

3.4.4. Trigonometric series [TS{n, d) ]. Let X be a uniform random vari- 
able on [0,1]. Let d be an even number. In TS(n,(i), we consider 

ip{X) = (cos(27rX), . . . ,cos(d7rX),sin(27rX), . . . , sin(d7rX))^, 

and the input-output pair is such that 

Y = 20^2 - lOX -l + aW 

with cj = 10. One can check that this implies 
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3.4.5. Experiments. 

Choice of the parameters and implementation details. The min-max trun- 
cated algorithm has two parameters a and A. In the subsequent experiments, 
we set the ridge parameter A to the natural default choice for it: A = 0. For 
the truncation parameter a, according to our analysis [see (3.9)], it roughly 
should be of order l/o"^ up to kurtosis coefficients. By using the ordinary 
least squares estimator, we roughly estimate this value, and test values of a 
in a geometric grid (of 8 points) around it (with ratio 3). Cross-validation 
can be used to select the final a. Nevertheless, it is computationally expen- 
sive and is significantly outperformed in our experiments by the following 
simple procedure: start with the smallest a in the geometric grid and in- 
crease it as long as 6 = 6i, that is, as long as we stop at the end of the first 
iteration and output the empirical risk minimizer. 

To compute ^fc,i(7?) or 2(^)1 one needs to determine a least squares 
estimate (for a modified sample). To reduce the computational burden, we 
do not want to test all possible values of rj (note that there are at most n 
values leading to different estimates). Our experiments show that testing 
only three levels of rj is sufficient. Precisely, we sort the quantity 



by decreasing order and consider ij being the first, 5th and 25th value of 
the ordered list. Overall, in our experiments, the computational complexity 
is approximately fifty times larger than the one of computing the ordinary 
least squares estimator. 

Results. The tables and figures have been gathered in the Appendix. Ta- 
bles 1 and 2 give the results for the mixture noise. Tables 3, 4 and 5 provide 
the results for the heavy-tailed noise and the standard Gaussian noise. Each 
line of the tables has been obtained after 1,000 generations of the train- 
ing set. These results show that the min-max truncated estimator is often 
equal to the ordinary least squares estimator while it ensures impres- 

sive consistent improvements when it differs from f^°^^\ In this latter case, 
the number of points that are not considered in /, that is, the number of 
points with low signal-to-noise ratio, varies a lot from 1 to 150 and is often 
of order 30. Note that not only the points that we expect to be considered 
as outliers (i.e., very large output points) are erased, and that these points 
seem to be taken out by local groups: see Figures 1 and 2 in which the erased 
points are marked by surrounding circles. 

Besides, the heavier the noise tail is (and also the larger the variance of 
the noise is), the more often the truncation modifies the initial ordinary 
least squares estimator, and the more improvements we get from the min- 
max truncated estimator, which also becomes much more robust than the 
ordinary least squares estimator (see the confidence intervals in the tables). 
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Finally, we have also tested more traditional methods in robust regression, 
namely, the M-estimators with Huber's loss, Li-loss and Tukey's bisquare 
influence function, and also the least trimmed squares estimator, the S- 
estimator and the MM-estimator (see [9, 13] and the references within). 
These methods rely on diminishing the influence of points having "unrea- 
sonably" large residuals. They were developed to handle training sets con- 
taining true outliers, that is, points {X, Y) not generated by the distribu- 
tion P. This is not the case in our estimation framework. By overweighting 
points having reasonably small residuals, these methods are often biased 
even in settings where the noise is symmetric and the regression function 
/(reg) . 3. ^ E[y|X = x] belongs to Fi^^ (i.e., /('""s) = /j*^), and also even when 
there is no noise (but /('"'^s) ^ /lin)- 

The worst results were obtained by the Li-loss, since estimating the 
(conditional) median is here really different from estimating the (condi- 
tional) mean. The MM-estimator and the M-estimators with Huber's loss 
and Tukey's bisquare influence function give good results as long as the 
signal-to-noise ratio is low. When the signal-to-noise ratio is high, a lack 
of consistency drastically appears in part of our simulations, showing that 
these methods are thus not suited for our estimation framework. 

The S-estimator is almost consistently improving on the ordinary least 
squares estimator (in our simulations). However, when the signal-to- noise ra- 
tio is low (i.e., in the setting of the aforementioned simulations with a = 10), 
the improvements are much less significant than the ones of the min-max 
truncated estimator. 

4. Main ideas of the proofs. The goal of this section is to explain the 
key ingredients appearing in the proofs which both allow to obtain subexpo- 
nential tails for the excess risk under a nonexponential moment assumption 
and get rid of the logarithmic factor in the excess risk bound. 

4.1. Suhexponential tails under a nonexponential moment assumption via 
truncation. Let us start with the idea allowing us to prove exponential 
inequalities under just a moment assumption (instead of the traditional 
exponential moment assumption). To understand it, we can consider the 
(apparently) simplistic 1-dimensional situation in which we have = R 
and the marginal distribution of ipi{X) is the Dirac distribution at 1. In 
this case, the risk of the prediction function Jq is R{fe) = IE[(1^ — ^)^] = 
E[(y — Ey)^] + (Ey — 0)^, so that the least squares regression problem boils 
down to the estimation of the mean of the output variable. If we only as- 
sume that Y admits a finite second moment, say, E(y^) < 1, it is not clear 
whether for any e > 0, it is possible to find 9 such that, with probability at 
least 1 — 2e, 





n 
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for some numerical constant c. Indeed, from Chebyshev's inequality, the 
trivial choice = ^ Y17=i ^ J^^^ satisfies, with probability at least 1 — 2e, 

R{fs)-R{n<^^, 

which is far from the objective (4.1) for small confidence levels [consider 
e = exp(— -y/n), e.g.]. The key idea is thus to average (soft) truncated values 
of the outputs. This is performed by taking 



i=l ^ ' 



with A = \/ ^^°^*'^ — ^. Since we have 



/ \2 
logEexp(nA^) = nlogfl + AE(y) + —E{Y'^) 

\2 

<nAE(y)+ny, 

the exponential Chebyshev's inequality guarantees that with probability at 
least 1 — e, we have n\{d - E(y)) < + log(e-^), hence. 



e-E{Y) < 



21og(e-i: 



n 

Replacing Y by —Y in the previous argument, we obtain that, with proba- 
bility at least 1 — e, we have 



,a{E(r) + ±|:i„g(i-Ar. + ^)} 



A^ 

< ny + log(e~^). 



Since -log(l + x + xy2) < log(l - x + x^/2), this implies E(y) - 

e < The two previous inequalities imply inequality (4.1) (for 

c = 2), showing that subexponential tails are achievable even when we only 
assume that the random variable admits a finite second moment (see [5] for 
more details on the robust estimation of the mean of a random variable). 

4.2. Localized PAC-Bayesian inequalities to eliminate a logarithm factor. 
Let us first recall that the Kullback-Leibler divergence between distribu- 
tions p and /i defined on T is 



if p ^ p. 



(4.2) ir(p,p)4p^/-^l°S 

I, +00, otherwise, 

where ^ denotes as usual the density of p w.r.t. p. For any real-valued 
(measurable) function h defined on such that j exp[/i(/)]7r((i/) < +oo, we 
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define the distribution vr/i on T by its density: 
^4 3^ — (/)- *^^P[^^-^)] 



dvr^^^ /exp[/i(/0]7r(d/')' 

The analysis of statistical inference generally relies on upper bounding the 
supremum of an empirical process x indexed by the functions in a model T . 
Concentration inequalities appear as a central tool to obtain these bounds. 
An alternative approach, called the PAC-Bayesian one, consists in using the 
entropic equality 

(4.4) Eexp(^sup |y />(4f)x(/) - ^(/5,vr')}) = ^ 7r'(d/)Eexp(x(/)), 

where ^A is the set of probability distributions on T . 

Let f : — )■ R be an observable process such that, for any / G J^, we have 

Eexp(x(/))<1 

for x(/) = A[i?(/) - f(/)] and some A > 0. Then (4.4) leads to, for any e > 0, 
with probability at least 1 — e, for any distribution p on T ^ we have 

K( p,7rO + log(£-^) 
A 



(4.5) p{df)R{f)< p{dmf) + 



The left-hand side quantity represents the expected risk with respect to the 
distribution p. To get the smallest upper bound on this quantity, a natural 
choice of the (posterior) distribution p is obtained by minimizing the right- 
hand side, that is, by taking p = vr^^. [with the notation introduced in (4.3)]. 
This distribution concentrates on functions f ^ J- for which f(/) is small. 
Without prior knowledge, one may want to choose a prior distribution vr' = 
vr which is rather "flat" (e.g., the one induced by the Lebesgue measure 
in the case of a model J- defined by a bounded parameter set in some 
Euclidean space). Consequently, the Kullback-Leibler divergence K{p,Tr'), 
which should be seen as the complexity term, might be excessively large. 

To overcome the lack of prior information and the resulting high complex- 
ity term, one can alternatively use a more "localized" prior distribution. Here 
we use Gaussian distributions centered at the function of interest (e.g., the 
function /*), and with covariance matrix proportional to the inverse of the 
Gram matrix Q. The idea of using PAC-Bayesian inequalities with Gaussian 
prior and posterior distributions goes back to Langford and Shawe- Taylor [7] 
in the context of linear classification. 

The detailed proofs of Theorems 2.1, 2.2 and 3.1 can be found in the 
supplementary material [2]. 



APPENDIX: EXPERIMENTAL RESULTS FOR THE MIN~MAX 
TRUNCATED ESTIMATOR (SECTION 3.3) 



Table 1 

Comparison of the min-max truncated estimator f with the ordinary least squares estimator for the mixture noise (see 

Section 3.4-1) with p = 0.1 and p — 0.005. In parenthesis, the QbVo- confidence intervals for the estimated quantities 



Nb of 
iterations 



Nb of iter, with 



Nb of iter, with 

fl(/)<fi(/^°'=') 



ER(/<°'=>)-fl(/*) EH(/)-R(/*) 



ER[(/(°'=))|/V/'°"=^] 

-RW) 



E[fl(/)|/V/<°'=^] 

-RW) 



INC (n = 200,d=l) 


1,000 


419 


405 


0, 


.567 


(±0.083) 


0, 


.178 


(±0.025) 


1 


.191 


(±0.178) 


0, 


.262 


(±0.052) 


INC (n = 200,d = 2) 


1,000 


506 


498 


1 


.055 


(±0.112) 


0, 


.271 


(±0.030) 


1 


.884 


(±0.193) 


0, 


.334 


(±0.050) 


HCC (n = 200,d = 2) 


1,000 


502 


494 


1 


.045 


(±0.103) 


0, 


.267 


(±0.024) 


1 


.866 


(±0.174) 


0, 


.316 


(±0.032) 


TS (n = 200,(i = 2) 


1,000 


561 


554 


1 


.069 


(±0.089) 


0, 


.310 


(±0.027) 


1 


.720 


(±0.132) 


0, 


.367 


(±0.036) 


INC (n=l,000,d = 2) 


1,000 


402 


392 


0, 


.204 


(±0.015) 


0, 


.109 


(±0.008) 


0, 


.316 


(±0.029) 


0, 


.081 


(±0.011) 


INC (n= 1,000, d = 10) 


1,000 


950 


946 


1 


.030 


(±0.041) 


0, 


.228 


(±0.016) 


1 


.051 


(±0.042) 


0, 


.207 


(±0.014) 


HCC (n = l,000,d=10) 


1,000 


942 


942 


0, 


.980 


(±0.038) 


0, 


.222 


(±0.015) 


1 


.008 


(±0.039) 


0, 


.203 


(±0.015) 


TS (n = l,000,d=10) 


1,000 


976 


973 


1 


.009 


(±0.037) 


0, 


.228 


(±0.017) 


1 


.018 


(±0.038) 


0, 


.217 


(±0.016) 


INC (n = 2,000, d = 2) 


1,000 


209 


207 


0, 


.104 


(±0.007) 


0, 


.078 


(±0.005) 


0, 


.206 


(±0.021) 


0, 


.082 


(±0.012) 


HCC (ri = 2,000,d = 2) 


1,000 


184 


183 


0, 


.099 


(±0.007) 


0, 


.076 


(±0.005) 


0, 


.196 


(±0.023) 


0, 


.070 


(±0.010) 


TS (n = 2,000,d = 2) 


1,000 


172 


171 


0, 


.101 


(±0.007) 


0, 


.080 


(±0.005) 


0, 


.206 


(±0.020) 


0, 


.083 


(±0.012) 


INC (n = 2,000, ci = 10) 


1,000 


669 


669 


0, 


.510 


(±0.018) 


0, 


.206 


(±0.012) 


0, 


.572 


(±0.023) 


0, 


.117 


(±0.009) 


HCC (n = 2,000,d=10) 


1,000 


669 


669 


0, 


.499 


(±0.018) 


0, 


.207 


(±0.013) 


0, 


.561 


(±0.023) 


0, 


.125 


(±0.011) 


TS (n = 2,000,d=10) 


1,000 


754 


753 


0, 


.516 


(±0.018) 


0, 


.195 


(±0.013) 


0, 


.558 


(±0.022) 


0, 


.131 


(±0.011) 



> 
a 

M 

H 

> 
'Z 
O 

P 

o 

o 

2 



Table 2 

Comparison of the min-max truncated estimator f with the ordinary least squares estimator for the mixture noise (see 

Section 3.4-1) with p = 0.4 and p — 0.005. In parenthesis, the QbVo- confidence intervals for the estimated quantities 





Nb of 


Nb of iter, with 


Nb of iter, with 














E[fl(/)|/V/<°"=^] 




iterations 


H(/)7i-R(/<°'=>) 


fi(/)<H(/<°'=') 


EH(/<°' 


->)-«(/*) 


EH(/)-R(/*) 


-R(r 


) 


-RW) 


INC (n = 200,d=l) 


1,000 


234 


211 


0.551 


(±0.063) 


0.409 


(±0.042) 


1.211 


(±0.210) 


0.606 (±0.110) 


INC (n = 200,d = 2) 


1,000 


195 


186 


1.046 


(±0.088) 


0.788 


(±0.061) 


2.174 


(±0.293) 


0.848 (±0.118) 


HCC (n = 200,d = 2) 


1,000 


222 


215 


1.028 


(±0.079) 


0.748 


(±0.051) 


2.157 


(±0.243) 


0.897 (±0.112) 


TS (n = 200,(i = 2) 


1,000 


291 


268 


1.053 


(±0.079) 


0.805 


(±0.058) 


1.701 


(±0.186) 


0.851 (±0.093) 


INC (n=l,000,d = 2) 


1,000 


127 


117 


0.201 


(±0.013) 


0.181 


(±0.012) 


0.366 


(±0.053) 


0.207 (±0.035) 


INC (n= 1,000, d = 10) 


1,000 


262 


249 


1.023 


(±0.035) 


0.902 


(±0.030) 


1.238 


(±0.081) 


0.777 (±0.054) 


HCC (n = l,000,d=10) 


1,000 


201 


192 


0.991 


(±0.033) 


0.902 


(±0.031) 


1.235 


(±0.088) 


0.790 (±0.067) 


TS (n = l,000,d=10) 


1,000 


171 


162 


1.009 


(±0.033) 


0.951 


(±0.031) 


1.166 


(±0.098) 


0.825 (±0.071) 


INC (n = 2,000, d = 2) 


1,000 


80 


77 


0.105 


(±0.007) 


0.099 


(±0.006) 


0.214 


(±0.042) 


0.135 (±0.029) 


HCC (ri = 2,000,d = 2) 


1,000 


44 


42 


0.102 


(±0.007) 


0.099 


(±0.007) 


0.187 


(±0.050) 


0.120 (±0.034) 


TS (n = 2,000,d = 2) 


1,000 


47 


47 


0.101 


(±0.007) 


0.099 


(±0.007) 


0.147 


(±0.032) 


0.103 (±0.026) 


INC (n = 2,000, ci = 10) 


1,000 


116 


113 


0.511 


(±0.016) 


0.491 


(±0.016) 


0.611 


(±0.052) 


0.437 (±0.042) 


HCC (n = 2,000,d=10) 


1,000 


110 


105 


0.500 


(±0.016) 


0.481 


(±0.015) 


0.602 


(±0.056) 


0.430 (±0.044) 


TS (n = 2,000,d=10) 


1,000 


101 


98 


0.511 


(±0.016) 


0.499 


(±0.016) 


0.601 


(±0.054) 


0.486 (±0.051) 
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Table 3 

Comparison of the mm-max truncated estimator f with the ordinary least squares estimator f^°^^^ with the heavy-tailed noise (see 

Section 3.4.1) 



Nb of Nb of iter, with Nb of iter, with 

iterations R(f)^R{f<^°'''>) R(/)< R(/(°'=> ) 



EH(/<°'=>)-H(/*) EH(/)-R(/*) 



ER[(/(°'=)) 



\f^f 



E[fl(/)|/V/<°'=)] 
-«(/*) 



INC (n = 200,d=l) 


1,000 


163 


145 


7.72 


(±3.46) 


3 


.92 


(±0.409) 


30.52 


(±20.8) 


7.20 


(±1.61) 


INC (n = 200,d = 2) 


1,000 


104 


98 


22.69 


(±23.14) 


19 


.18 


(±23.09) 


45.36 


(±14.1) 


11.63 


(±2.19) 


HCC (n = 200,d = 2) 


1,000 


120 


117 


18.16 


(±12.68) 


8, 


.07 


(±0.718) 


99.39 


(±105) 


15.34 


(±4.41) 


TS (n = 200,(i = 2) 


1,000 


110 


105 


43.89 


(±63.79) 


39 


.71 


(±63.76) 


48.55 


(±18.4) 


10.59 


(±2.01) 


INC (n=l,000,d = 2) 


1,000 


104 


100 


3.98 


(±2.25) 


1, 


.78 


(±0.128) 


23.18 


(±21.3) 


2.03 


(±0.56) 


INC (n= 1,000, d = 10) 


1,000 


253 


242 


16.36 


(±5.10) 


7, 


.90 


(±0.278) 


41.25 


(±19.8) 


7.81 


(±0.69) 


HCC (n = l,000,d=10) 


1,000 


220 


211 


13.57 


(±1.93) 


7, 


.88 


(±0.255) 


33.13 


(±8.2) 


7.28 


(±0.59) 


TS (n = l,000,d=10) 


1,000 


214 


211 


18.67 


(±11.62) 


13, 


.79 


(±11.52) 


30.34 


(±7.2) 


7.53 


(±0.58) 


INC (n = 2,000, d = 2) 


1,000 


113 


103 


1.56 


(±0.41) 


0, 


.89 


(±0.059) 


6.74 


(±3.4) 


0.86 


(±0.18) 


HCC (ri = 2,000,d = 2) 


1,000 


105 


97 


1.66 


(±0.43) 





.95 


(±0.062) 


7.87 


(±3.8) 


1.13 


(±0.23) 


TS (n = 2,000,d = 2) 


1,000 


101 


95 


1.59 


(±0.64) 





.88 


(±0.058) 


8.03 


(±6.2) 


1.04 


(±0.22) 


INC (n = 2,000, ci = 10) 


1,000 


259 


255 


8.77 


(±4.02) 


4, 


.23 


(±0.154) 


21.54 


(±15.4) 


4.03 


(±0.39) 


HCC (n = 2,000,d=10) 


1,000 


250 


242 


6.98 


(±1.17) 


4, 


.13 


(±0.127) 


15.35 


(±4.5) 


3.94 


(±0.25) 


TS (n = 2,000,d=10) 


1,000 


238 


233 


8.49 


(±3.61) 


5 


.95 


(±3.486) 


14.82 


(±3.8) 


4.17 


(±0.30) 
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Table 4 

Comparison of the mm~max truncated estimator f with the ordinary least squares estimator /'"'^^ with the asymmetric heavy-tailed noise 

(see Section 3.4-1) 





Nb of 


Nb of iter, with 


Nb of iter, with 






Efi[(/(°'=))|/V/<°'=^] 


E[fl(/)|/V/<°'=^] 




iterations 




fi(/)<H(/<°'=') 


EK(/(°i=))_H(/«) 


EH(/)-R(/*) 


-Rif) 


-«(/*) 


INC (n = 200,d=l) 


1,000 


87 


77 


5.49 (±3.07) 


3.00 (±0.330) 


35.44 (±34.7) 


6.85 


(±2.48) 


INC (n = 200,d = 2) 


1,000 


70 


66 


19.25 (±23.23) 


17.4 (±23.2) 


37.95 (±13.1) 


11.05 


(±2.87) 


HCC (n = 200,d = 2) 


1,000 


67 


66 


7.19 (±0.88) 


5.81 (±0.397) 


31.52 (±10.5) 


10.87 


(±2.64) 


TS (n = 200,(i = 2) 


1,000 


76 


68 


39.80 (±64.09) 


37.9 (±64.1) 


34.28 (±14.8) 


9.21 


(±2.05) 


INC (n=l,000,d = 2) 


1,000 


101 


92 


2.81 (±2.21) 


1.31 (±0.106) 


16.76 (±21.8) 


1.88 


(±0.69) 


INC (n= 1,000, d = 10) 


1,000 


211 


195 


10.71 (±4.53) 


5.86 (±0.222) 


29.00 (±21.3) 


6.03 


(±0.71) 


HCC (n = l,000,d=10) 


1,000 


197 


185 


8.67 (±1.16) 


5.81 (±0.177) 


20.31 (±5.59) 


5.79 


(±0.43) 


TS (n = l,000,d=10) 


1,000 


258 


233 


13.62 (±11.27) 


11.3 (±11.2) 


14.68 (±2.45) 


5.60 


(±0.36) 


INC (n = 2,000, d = 2) 


1,000 


106 


92 


1.04 (±0.37) 


0.64 (±0.042) 


4.54 (±3.45) 


0.79 


(±0.16) 


HCC (ri = 2,000,d = 2) 


1,000 


99 


90 


0.90 (±0.11) 


0.66 (±0.042) 


3.23 (±0.93) 


0.82 


(±0.16) 


TS (n = 2,000,d = 2) 


1,000 


84 


81 


1.11 (±0.66) 


0.60 (±0.042) 


6.80 (±7.79) 


0.69 


(±0.17) 


INC (n = 2,000, ci = 10) 


1,000 


238 


222 


6.32 (±4.18) 


3.07 (±0.147) 


16.84 (±17.5) 


3.18 


(±0.51) 


HCC (n = 2,000,d=10) 


1,000 


221 


203 


4.49 (±0.98) 


2.98 (±0.091) 


9.76 (±4.39) 


2.93 


(±0.22) 


TS (n = 2,000,d=10) 


1,000 


412 


350 


5.93 (±3.51) 


4.59 (±3.44) 


6.07 (±1.76) 


2.84 


(±0.16) 
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Table 5 

Comparison of the min-max truncated estimator f with the ordinary least squares estimator f^°^^^ for standard Gaussian noise 





Nb of 


Nb of iter, with 


Nb of iter, with 








ER[(/<°'-))|/V/'°''=^] 


E[R(/)|/V/<°"=^] 




iterations 




R(/)<R(/<°'=>) 


Eii(/<°' 


->)-R(/*) 


ER(f)-R{f) 


-R(f') 


-R(f') 


INC (n = 200,ci=l) 


1,000 


20 


8 


0.541 


(±0.048) 


0.541 (±0.048) 


0.401 (±0.168) 


0.397 (±0.167) 


INC (n = 200, d = 2) 


1,000 


1 





1.051 


(±0.067) 


1.051 (±0.067) 


2.566 


2.757 


HCC (n = 200, d = 2) 


1,000 


1 





1.051 


(±0.067) 


1.051 (±0.067) 


2.566 


2.757 


TS (n = 200,d = 2) 


1,000 








1.068 


(±0.067) 


1.068 (±0.067) 






INC (n = l,000,(i = 2) 


1,000 








0.203 


(±0.013) 


0.203 (±0.013) 






INC (n = 1,000, (i = 10) 


1,000 








1.023 


(±0.029) 


1.023 (±0.029) 






HCC (n = l,000,d = 10) 


1,000 








1.023 


(±0.029) 


1.023 (±0.029) 






TS (n = l,000,d=10) 


1,000 








0.997 


(±0.028) 


0.997 (±0.028) 






INC (n = 2,000,d = 2) 


1,000 








0.112 


(±0.007) 


0.112 (±0.007) 






HCC (ri = 2,000,d = 2) 


1,000 








0.112 


(±0.007) 


0.112 (±0.007) 






TS (n = 2,000,d = 2) 


1,000 








0.098 


(±0.006) 


0.098 (±0.006) 






INC (n = 2,000, ci = 10) 


1,000 








0.517 


(±0.015) 


0.517 (±0.015) 






HCC (n = 2,000,d=10) 


1,000 








0.517 


(±0.015) 


0.517 (±0.015) 






TS (n = 2,000,d=10) 


1,000 








0.501 


(±0.015) 


0.501 (±0.015) 
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Fig. 1. Circled points are the points of the training set generated several times from 
2^5(1,000,10) (with the mixture noise with p = 0.005 and p = 0.4) that are not taken into 
account in the min-max truncated estimator (to the extent that the estimator would not 
change by removing simultaneously all these points). The min-max truncated estimator 
X I— f{x) appears in dash-dot line, while x M- EC^IX = a;) is in solid line. In these six 
simulations, it outperforms the ordinary least squares estimator. 
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Fig. 2. Circled points are the points of the training set generated several times from 
TS{200,2) (with the heavy-tailed noise) that are not taken into account in the min — 
max truncated estimator (to the extent that the estimator would not change by removing 
these points). The min-max truncated estimator x i— >■ f{x) appears in dash-dot line, while 
X 1—^ E(y|X = x) is in solid line. In these six simulations, it outperforms the ordinary least 
squares estimator. Note that in the last figure, it does not consider 64 points among the 
200 training points. 
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SUPPLEMENTARY MATERIAL 

Supplement to "Robust linear least squares regression" 

(DOL 10.1214/11-AOS918SUPP; .pdf). The supplementary material pro- 
vides the proofs of Theorems 2.1, 2.2 and 3.1. 
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